Load packages

If you are using the package for the first time, you will first have to install it.  

# install.packages("survival") 
# install.packages("memisc")

If you have already downloaded this package in the current version of R, you will only have to load the package.

library(survival)
## Warning: package 'survival' was built under R version 4.0.4
library(memisc)
## Warning: package 'memisc' was built under R version 4.0.3
## Loading required package: lattice
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 4.0.4
## 
## Attaching package: 'memisc'
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array

Get the data

Load a data set from a package.
You can use the double colon symbol (:), to return the pbc object from the package survival. We store this data set to an object with the name pbc.

pbc <- survival::pbc

Common questions that can be answered in R

Continuous data

What is the mean and standard deviation for the variable age of the pbc data set?

mean(x = pbc$age)
## [1] 50.74155
mean(x = pbc$age, na.rm = TRUE)
## [1] 50.74155
sd(x = pbc$age)
## [1] 10.44721

What is the mean and variance for the variable chol of the pbc data set?

mean(x = pbc$chol)
## [1] NA
mean(x = pbc$chol, na.rm = TRUE)
## [1] 369.5106
var(x = pbc$chol, na.rm = TRUE)
## [1] 53798.27

What is the median and interquartile range for the variable age of the pbc data set?

median(x = pbc$age)
## [1] 51.00068
IQR(x = pbc$age)
## [1] 15.40862

What is the min and max of the variable age of the pbc data set?

min(x = pbc$age)
## [1] 26.27789
max(x = pbc$age)
## [1] 78.43943
range(x = pbc$age)
## [1] 26.27789 78.43943

What are the 10th, 25th, 50th, 75th and 90th percentiles for serum bilirubin of the pbc data set?

quantile(x = pbc$bili, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))
##  10%  25%  50%  75%  90% 
## 0.60 0.80 1.40 3.40 8.03

The functions colMeans() and rowMeans() allow us to calculate the mean for each column or column in a matrix or data.frame, e.g.:

colMeans(x = data.frame(bili = pbc$bili, chol = pbc$chol), na.rm = TRUE)
##       bili       chol 
##   3.220813 369.510563
rowMeans(x = data.frame(bili = pbc$bili, chol = pbc$chol), na.rm = TRUE)
##   [1] 137.75 151.55  88.70 122.90 141.20 124.40 161.50 140.15 282.60 106.30
##  [11] 130.20 119.80 140.85   0.80 115.90 102.35 138.35  94.70 117.85 189.55
##  [21] 126.30 137.20 206.20 229.05 149.35 566.60  98.30 119.60 185.35 131.80
##  [31] 150.35 131.90 105.40 182.40 157.60  86.15 170.55 193.15 141.35   1.30
##  [41]   6.80   2.10 181.05 151.15   0.60 243.85 158.25 130.45   0.80 129.05
##  [51] 138.40 310.00   2.60 144.65 208.90 249.55 131.15 121.35 164.90 302.45
##  [61] 108.30 151.65 477.25 187.55 128.60 214.20 233.55  87.35 336.00   0.60
##  [71] 129.60 160.25  66.35 283.20 345.55 203.10 125.30 221.15 157.90 127.10
##  [81] 231.20 238.25 125.65 131.70 132.05 802.50 173.05 148.30 205.00 330.80
##  [91] 165.00 103.70 177.15 102.10  17.40   1.00 211.00 120.00 230.90  90.15
## [101] 200.45 124.45  95.25 152.05 232.55   2.10 106.30  63.70  60.25 243.95
## [111] 266.75 134.50 190.35 131.10 151.85 230.50 478.25 196.75 318.30 164.25
## [121]  76.15 149.30   5.10 125.80 158.65 135.10 134.25  16.20 210.45 896.20
## [131] 122.40 224.95 166.25 289.35 131.70 131.90 200.05 216.65 164.55 145.55
## [141] 173.45 182.50 167.45 292.00 154.85   1.20 144.60 511.10 130.00   1.00
## [151] 230.45 294.15 108.75  85.20 110.30 191.75 143.30 226.70 159.75 108.80
## [161] 252.15 131.60 116.65   8.50 100.00 742.85 188.45 128.70 204.65 195.60
## [171]   0.50 103.15 119.50   0.50 141.90   3.20 129.45   0.60 198.90 241.35
## [181] 124.70   0.60 100.75 342.50 128.40 113.50 411.00  93.85 180.65   2.30
## [191] 558.25 154.45 471.40 147.25 175.35 113.70 133.30 143.35 197.05 120.35
## [201] 117.80 111.75  74.75 127.85 192.25 106.80   0.60 199.95 126.35 173.45
## [211]   1.30 116.60 200.25 202.45 640.95   0.50 309.70   0.50 108.30 214.90
## [221] 180.45 188.25 231.05 155.00 137.35 111.75 159.15 107.85  97.75 152.65
## [231] 260.70 133.70 257.45 289.45 674.50 127.25 221.80 140.30 150.40 116.20
## [241] 160.20 177.95 238.00 176.95 136.80 194.55 859.05 162.40 121.65 149.80
## [251] 113.75 123.55 125.05 115.05  96.85 168.55 140.25 207.55 140.05 118.80
## [261] 189.10 162.40 216.55 179.70 175.75 159.25 114.30 175.20 188.80 224.50
## [271] 161.00 113.25 165.10   1.60 287.10 110.00 159.00 171.80  99.25 163.30
## [281]  96.45 152.65 206.55 146.15 126.90 156.00 189.70 159.35 210.00 147.70
## [291] 171.10 277.30 101.25 503.30 324.20 164.40 138.10 170.55 172.20   5.20
## [301] 197.00 167.85 186.50 109.75 214.45 119.80 136.90 123.20 130.20 217.85
## [311] 124.50 291.20   0.70   1.40   0.70   0.70   0.80   0.70   5.00   0.40
## [321]   1.30   1.10   0.60   0.60   1.80   1.50   1.20   1.00   0.70   3.50
## [331]   3.10  12.60   2.80   7.10   0.60   2.10   1.80  16.00   0.60   5.40
## [341]   9.00   0.90  11.10   8.90   0.50   0.60   3.40   0.90   1.40   2.10
## [351]  15.00   0.60   1.30   1.30   1.60   2.20   3.00   0.80   0.80   1.80
## [361]   5.50  18.00   0.60   2.70   0.90   1.30   1.10  13.80   4.40  16.00
## [371]   7.30   0.60   0.70   0.70   1.70   9.50   2.20   1.80   3.30   2.90
## [381]   1.70  14.00   0.80   1.30   0.70   1.70  13.60   0.90   0.70   3.00
## [391]   1.20   0.40   0.70   2.00   1.40   1.60   0.50   7.30   8.10   0.50
## [401]   4.20   0.80   2.50   4.60   1.00   4.50   1.10   1.90   0.70   1.50
## [411]   0.60   1.00   0.70   1.20   0.90   1.60   0.80   0.70

The functions colSums() and rowSums() allow us to calculate the sum for each column or column in a matrix or data.frame, e.g.:

colSums(x = data.frame(bili = pbc$bili, chol = pbc$chol), na.rm = TRUE)
##     bili     chol 
##   1346.3 104941.0
rowSums(x = data.frame(bili = pbc$bili, chol = pbc$chol), na.rm = TRUE)
##   [1]  275.5  303.1  177.4  245.8  282.4  248.8  323.0  280.3  565.2  212.6
##  [11]  260.4  239.6  281.7    0.8  231.8  204.7  276.7  189.4  235.7  379.1
##  [21]  252.6  274.4  412.4  458.1  298.7 1133.2  196.6  239.2  370.7  263.6
##  [31]  300.7  263.8  210.8  364.8  315.2  172.3  341.1  386.3  282.7    1.3
##  [41]    6.8    2.1  362.1  302.3    0.6  487.7  316.5  260.9    0.8  258.1
##  [51]  276.8  620.0    2.6  289.3  417.8  499.1  262.3  242.7  329.8  604.9
##  [61]  216.6  303.3  954.5  375.1  257.2  428.4  467.1  174.7  672.0    0.6
##  [71]  259.2  320.5  132.7  566.4  691.1  406.2  250.6  442.3  315.8  254.2
##  [81]  462.4  476.5  251.3  263.4  264.1 1605.0  346.1  296.6  410.0  661.6
##  [91]  330.0  207.4  354.3  204.2   17.4    1.0  422.0  240.0  461.8  180.3
## [101]  400.9  248.9  190.5  304.1  465.1    2.1  212.6  127.4  120.5  487.9
## [111]  533.5  269.0  380.7  262.2  303.7  461.0  956.5  393.5  636.6  328.5
## [121]  152.3  298.6    5.1  251.6  317.3  270.2  268.5   16.2  420.9 1792.4
## [131]  244.8  449.9  332.5  578.7  263.4  263.8  400.1  433.3  329.1  291.1
## [141]  346.9  365.0  334.9  584.0  309.7    1.2  289.2 1022.2  260.0    1.0
## [151]  460.9  588.3  217.5  170.4  220.6  383.5  286.6  453.4  319.5  217.6
## [161]  504.3  263.2  233.3    8.5  200.0 1485.7  376.9  257.4  409.3  391.2
## [171]    0.5  206.3  239.0    0.5  283.8    3.2  258.9    0.6  397.8  482.7
## [181]  249.4    0.6  201.5  685.0  256.8  227.0  822.0  187.7  361.3    2.3
## [191] 1116.5  308.9  942.8  294.5  350.7  227.4  266.6  286.7  394.1  240.7
## [201]  235.6  223.5  149.5  255.7  384.5  213.6    0.6  399.9  252.7  346.9
## [211]    1.3  233.2  400.5  404.9 1281.9    0.5  619.4    0.5  216.6  429.8
## [221]  360.9  376.5  462.1  310.0  274.7  223.5  318.3  215.7  195.5  305.3
## [231]  521.4  267.4  514.9  578.9 1349.0  254.5  443.6  280.6  300.8  232.4
## [241]  320.4  355.9  476.0  353.9  273.6  389.1 1718.1  324.8  243.3  299.6
## [251]  227.5  247.1  250.1  230.1  193.7  337.1  280.5  415.1  280.1  237.6
## [261]  378.2  324.8  433.1  359.4  351.5  318.5  228.6  350.4  377.6  449.0
## [271]  322.0  226.5  330.2    1.6  574.2  220.0  318.0  343.6  198.5  326.6
## [281]  192.9  305.3  413.1  292.3  253.8  312.0  379.4  318.7  420.0  295.4
## [291]  342.2  554.6  202.5 1006.6  648.4  328.8  276.2  341.1  344.4    5.2
## [301]  394.0  335.7  373.0  219.5  428.9  239.6  273.8  246.4  260.4  435.7
## [311]  249.0  582.4    0.7    1.4    0.7    0.7    0.8    0.7    5.0    0.4
## [321]    1.3    1.1    0.6    0.6    1.8    1.5    1.2    1.0    0.7    3.5
## [331]    3.1   12.6    2.8    7.1    0.6    2.1    1.8   16.0    0.6    5.4
## [341]    9.0    0.9   11.1    8.9    0.5    0.6    3.4    0.9    1.4    2.1
## [351]   15.0    0.6    1.3    1.3    1.6    2.2    3.0    0.8    0.8    1.8
## [361]    5.5   18.0    0.6    2.7    0.9    1.3    1.1   13.8    4.4   16.0
## [371]    7.3    0.6    0.7    0.7    1.7    9.5    2.2    1.8    3.3    2.9
## [381]    1.7   14.0    0.8    1.3    0.7    1.7   13.6    0.9    0.7    3.0
## [391]    1.2    0.4    0.7    2.0    1.4    1.6    0.5    7.3    8.1    0.5
## [401]    4.2    0.8    2.5    4.6    1.0    4.5    1.1    1.9    0.7    1.5
## [411]    0.6    1.0    0.7    1.2    0.9    1.6    0.8    0.7

What is the correlation between serum bilirubin and serum cholesterol of the pbc data set?

cor(x = pbc$bili, pbc$chol, use = "complete.obs", method = "pearson")
## [1] 0.3971289
cor(x = pbc$bili, pbc$chol, use = "complete.obs", method = "spearman")
## [1] 0.4024538

What is the correlation matrix for the variables serum bilirubin, serum cholesterol and alkaline of the pbc data set?

cor(x = data.frame(pbc$bili, pbc$chol, pbc$albumin), 
    use = "complete.obs")
##               pbc.bili    pbc.chol pbc.albumin
## pbc.bili     1.0000000  0.39712889 -0.31310846
## pbc.chol     0.3971289  1.00000000 -0.06973277
## pbc.albumin -0.3131085 -0.06973277  1.00000000

What is the variance-covariance matrix for the above variables?

var(x = data.frame(pbc$bili, pbc$chol, pbc$albumin), 
    use = "complete.obs")
##                pbc.bili     pbc.chol pbc.albumin
## pbc.bili     20.7116508   419.201667  -0.5752636
## pbc.chol    419.2016672 53798.271973  -6.5295880
## pbc.albumin  -0.5752636    -6.529588   0.1629782
cov(x = data.frame(pbc$bili, pbc$chol, pbc$albumin), 
    use = "complete.obs")
##                pbc.bili     pbc.chol pbc.albumin
## pbc.bili     20.7116508   419.201667  -0.5752636
## pbc.chol    419.2016672 53798.271973  -6.5295880
## pbc.albumin  -0.5752636    -6.529588   0.1629782

A (co)variance matrix can be converted to a (pearson) correlation matrix with the help of the function cov2cor():

cov2cor(V = var(x = data.frame(pbc$bili, pbc$chol, pbc$albumin),
                use = "complete.obs"))
##               pbc.bili    pbc.chol pbc.albumin
## pbc.bili     1.0000000  0.39712889 -0.31310846
## pbc.chol     0.3971289  1.00000000 -0.06973277
## pbc.albumin -0.3131085 -0.06973277  1.00000000

Categorical data

What is the percentage of placebo and treatment patients in the pbc data set? (In order to use the percent() function you will need to load the memisc package)

percent(x = pbc$trt)
##         1         2         N 
##  50.64103  49.35897 312.00000

What is the percentage of females and males in the pbc data set?

percent(x = pbc$sex)
##         m         f         N 
##  10.52632  89.47368 418.00000

What are the frequencies of each combination for the variables trt and sex in the pbc data set?

table(trt = pbc$trt, sex = pbc$sex)
##    sex
## trt   m   f
##   1  21 137
##   2  15 139

To add summaries (e.g. the sum) for each column and/or row use the addmargins() function:

tab <- table(trt = pbc$trt, sex = pbc$sex)
addmargins(A = tab)
##      sex
## trt     m   f Sum
##   1    21 137 158
##   2    15 139 154
##   Sum  36 276 312

We can also change the function (e.g use the mean):

addmargins(A = tab, FUN = mean)
## Margins computed over dimensions
## in the following order:
## 1: trt
## 2: sex
##       sex
## trt      m   f mean
##   1     21 137   79
##   2     15 139   77
##   mean  18 138   78

What are the percentages of each combination for the variables trt and sex in the pbc data set?

prop.table(x = tab)
##    sex
## trt          m          f
##   1 0.06730769 0.43910256
##   2 0.04807692 0.44551282

For tables with more that 2 dimensions use:

ftable(x = table(trt = pbc$trt, sex = pbc$sex, ascites = pbc$ascites))
##         ascites   0   1
## trt sex                
## 1   m            20   1
##     f           124  13
## 2   m            13   2
##     f           131   8
ftable(x = data.frame(trt = pbc$trt, sex = pbc$sex, ascites = pbc$ascites))
##         ascites   0   1
## trt sex                
## 1   m            20   1
##     f           124  13
## 2   m            13   2
##     f           131   8

With the help of the arguments row.vars and col.vars we can determine which variables are given in the rows and which in the columns:

ftable(x = table(trt = pbc$trt, sex = pbc$sex, ascites = pbc$ascites), 
       row.vars = c(3, 2))
##             trt   1   2
## ascites sex            
## 0       m        20  13
##         f       124 131
## 1       m         1   2
##         f        13   8

Handling missing values and outliers

Check if there are any missing values in the serum cholesterol variable of the pbc data set:

is.na(pbc$chol)
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [49]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
## [181] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [217] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [325]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [337]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [349]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [361]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [373]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [385]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [397]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [409]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Check if there are any complete cases in the serum cholesterol variable of the pbc data set:

complete.cases(pbc$chol)
##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [13]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [37]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
##  [49] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
##  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
##  [97]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [109]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [121]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [145]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [157]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [169]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
## [181]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [193]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [205]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [217]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [229]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [241]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [253]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [265]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [277]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [289]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [301]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [349] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [385] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [409] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Obtain the dimensions of a matrix or data frame. We can use the function dim():

dim(pbc)
## [1] 418  20

Outliers: e.g. let’s assume that patients with serum bilirun values > 25 are outliers.

  • Check whether there are any outliers: obtain all rows from the data set which correspond to serum bilirun outliers:
pbc_out_bili <- pbc[pbc$bili > 25, ]

Calculate the mean and median of the serum bilirun variable without the outliers:

pbc_no_out_bili <- pbc[pbc$bili <= 25, ]

mean(pbc_no_out_bili$bili)
## [1] 3.107692
median(pbc_no_out_bili$bili)
## [1] 1.35

Calculate the mean and median of the serum bilirubin variable without the missing values in the serum cholesterol variable:

pbc_no_mis_chol <- pbc[complete.cases(pbc$chol) == TRUE, ]
mean(pbc_no_mis_chol$bili)
## [1] 3.276056
median(pbc_no_mis_chol$bili)
## [1] 1.4